


Wine is globally popular Alcoholic beverage which is made from different varieties of grapes (red & white) by fermentation of sugar(grape juice) to ethanol,carbon dioxide (CO2) and heat. Different types of grapes and varied strains of Yeast produce different variations of wines. These variations result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the terroir(environmental characteristics of crop), and the production process.
We will be doing Exploratory Data Analysis of the available wine datasets to derive insights on how different characteristics derive the quality and taste of the wine. How these features are interlinked and how the quality varies with different levels of these ingredients.
import numpy as np # Numpy is Python Package to work with numerical Data
import pandas as pd # Pandas Package will assist in working with Dataframes
pd.set_option('mode.chained_assignment', None) # To suppress pandas warnings.
pd.set_option('display.max_colwidth', -1) # To display all the data in each column
pd.options.display.max_columns = 50 # To display every column of the dataset in head()
import warnings
warnings.filterwarnings('ignore') # To suppress all the warnings in the notebook.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid', font_scale=1.3, color_codes=True) # To apply seaborn styles to the plots.
Dataset is provided by INSAID and taken from their github repository. Reference link : https://github.com/insaid2018/Term-1/blob/master/Data/Projects/winequality.csv
df_wine=pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/winequality.csv')
df_wine.head(5) ### Displaying the First Five Rows of the data.
The dataset input variables (based on physicochemical tests) are:
print("Shape of the wine dataset {s}".format(s=df_wine.shape)) ## Printing the shape of the dataset
df_wine.info()
This has given us the preview of our data which is:
df_wine.describe()
Although, we have seen above that we don't have a missing value, however we can confrim if we have any null value in any of the variable.
df_wine.isnull().sum()
This confirms there are no missing values in our dataset.
Pandas Profiling gives illustrative picture of the dataset. It helps to identify the variables to derive insights.
import pandas_profiling # import pandas_profiling package
profile=df_wine.profile_report(title='Pandas Profiling of Wine Dataset')
profile.to_file(output_file="profiling_wine_dataset.html") # Storing the Profile in separate HTML File
df_wine.profile_report(title='Pandas Profiling of Wine Dataset', style={'full_width':True}) # To display in notebook.
Profiling of the dataset has shown us that dataset is Clean without any abnormal values, missing values or negative values. It also gave the slight idea that we have 11 predictive variables and last one 'quality' a response variable .
During the course of our data explorataion and analysis , we will see how quality is getting affected from other variables which is also our problem statement.
Let's rename our columns to get rid of spaces in their name. It will be easy later to play with the dataset.
df_wine.rename(columns={'fixed acidity': 'fixed_acidity','citric acid':'citric_acid','volatile acidity':'volatile_acidity','residual sugar':'residual_sugar','free sulfur dioxide':'free_sulfur_dioxide','total sulfur dioxide':'total_sulfur_dioxide'}, inplace=True)
df_wine.head(n=5) ### Display dataset with updated column names.

Univariate analysis or visualization will give the insights about the data distribution of the each variable. It will provide statistics of each individual variable.
Lets plot histograms for each variable to understand the underlying distribution of
df_wine.hist(figsize=(20,15),color='skyblue')
plt.show()
To check the linearity of the variables it is a good practice to plot distribution graph and look for skewness of features. Kernel density estimate (kde) is a quite useful tool for plotting the shape of a distribution
sns.set(rc={'figure.figsize':(5,4)})
for i in df_wine.columns:
sns.distplot(df_wine[i])
plt.title(i)
plt.show()
'pH' column appears to be normally distributed
Rest all independent variables are right skewed/positively skewed
The above distribution analysis shows the main feature is quality, against which all other features can be measured. Other important features could be alcohol, acidity, sulfur dioxide and pH .
Check for the unique values of the quality
df_wine['quality'].unique() #### Checking unique values withing quality variable.
'''
Checking how data is distibuted among these unique quality values
and arranging them in ascending order
'''
df_wine.quality.value_counts().sort_index()
fig = plt.figure(figsize = (10,6))
plt.hist(df_wine["quality"].values, range=(1, 10),color='darkred')
plt.xlabel('Ratings of wines')
plt.ylabel('Count')
plt.title('Distribution of wine ratings')
plt.show()
We can see quality has a range from 3-9 with maximum observations falls between 5 to 6 and rare observation with high quality from 8-9.
Based on Quality Values we can distribute as following :
'''
Adding a new column'rating' to our dataset
Bifurcating Quality column based on values mentioned above.
Defining good and bad and keep all others in average category as default
'''
df_wine['rating']=np.select([(df_wine['quality'] <= 4),(df_wine['quality'] >= 7)],['bad','good'],default='average')
# Displaying data distribution in 'rating' column
df_wine['rating'].value_counts().plot(kind='pie',autopct='%5.1f%%',wedgeprops=dict(width=0.15),
figsize=(10, 8), fontsize=11,
startangle=20, shadow=True, cmap='binary')
plt.title('Rating Distribution',fontsize=16)
Above Donut Chart shows us majority of the data lies under "Average" Categrory.
mycor=df_wine.corr()
plt.subplots(figsize=(10,10))
sns.heatmap(mycor,annot=True,cmap='RdPu_r',vmax=1)
plt.title('Heat Map showing different Correaltion Among Variables',fontsize=16)
# Let's check the correlation values of quality with other variables.
mycor['quality'].sort_values(ascending=False)
sns.jointplot(x=df_wine["quality"],y=df_wine["pH"],kind='scatter',color='red')
plt.title('Variation of Quality with PH level',loc='left')
As we observe from the above analysis, there are the majority of wines falling into pH window of between 3.3 and 3.6, but it is not just the good quality wines. This suggests that this is a common basic feature of wine, rather than an indicator of good red wine.
bx = sns.factorplot('quality','alcohol',data=df_wine,color='red')
bx.set(xlabel='Wine Ratings', ylabel='Alcohol Percentage', title='Alcohol percentage (ABV)in different types of Wine ratings')
We observed with above graph that , good quality wines have around 11.5-12.5 % of Alcohol and bad quality wines have lower alcohol with median value around 10.
sns.jointplot(x=df_wine["quality"],y=df_wine["total_sulfur_dioxide"],kind='scatter',color='red')
SO2 or Sulphur Dioxide is widely used in wine making process as it lower the oxidation and kills the harmful bacteria, however having more sulphites in wine does not make it good in quality. Average wines have 250- 350 PPM of sulphur dioxide, bad quality wines have higher percentage , some above 400 as well and good wines have lower sulphur dioxide percentage as low as 90. So it is evident that use of Sulphur Dioxide needs to moderated.
Lets analyse some negative corelations
sns.jointplot(x=df_wine["quality"],y=df_wine["residual_sugar"],kind='scatter',color='red')
plt.title('Quality Vs Residual Sugar',loc='left')
residual_sugar is what is left of sugar after interacting with yeast in fermentation process. The above graph shows lesser the residual_sugar better is the quality of wine. Even most of the average wines have less residual_sugar.But as we see the bad wines also have less residual_sugar , so we cannot conclude on this as the most promising factor of having good wine, although its influence is considerable.
sns.boxplot(x='quality',y='fixed_acidity',data=df_wine,fliersize=5)
As mentioned above fixed_acidity has a negative correlation of with quality. Most of the bad quality wines have higher fixed_acidity values as compare to good quality wine.
sns.jointplot(x=df_wine["density"],y=df_wine["alcohol"],kind='kde',color='red')
plt.title('Density Plot showing variation of Alcohol with Density of Wine',loc='left')
As we see, alcohol has a strong negative corelation with density, higher the percentage of alcohol have lower density.
bx = sns.violinplot(x="rating", y='citric_acid', data = df_wine)
bx.set(xlabel='Wine Ratings', ylabel='Citric Acid', title='Citric_Acid in different types of Wine ratings')
sns.regplot(data=df_wine, x='quality', y='volatile_acidity', color='brown')
plt.title('Regplot showing the variation of quality with volatile Acidity')
Volatile acidity (VA) is a measure of the wine's volatile (or gaseous) acids. The primary volatile acid in wine is acetic acid, which is also the primary acid associated with the smell and taste of vinegar. Above observation shows good quality wines have low volatile acidity which can infer it will taste well and smells good.
sns.pairplot(df_wine,vars=['alcohol','sulphates','citric_acid','fixed_acidity','density'],hue='quality',diag_kind='kde')
Above Pair Plots show differnt relationships among differnt variable with overlaping quality.
We will analyze further how other variables gives different quality variation in following section
sns.scatterplot(x='fixed_acidity',y='volatile_acidity',hue='rating',data=df_wine,palette="husl")
plt.title('Quality based on Volatile_acidity and Fixed_acidity',fontsize=16)
sns.set(rc={'figure.figsize':(20,8.27)})
Above observation illustrates that majority of the wines with average wine quality has fixed_acidity less than 10 with few execptions. Also most of the good and average wines have low volatile_acidity less than 0.8 however bad wines are having volatile acidity. Since the dataset is mixed with both white wines and red wines , we cannot concur that which bad quality wine has low fixed acidity or high volatile acidity. Based on research it can be assumed that white wines exihibits more fixed acidity to make them sparkle and give crispiness on the palate. Hence, we can say the wines falling in good category with high fixed acidity can be white wines and which has lower fixed acidity and volatile acidity can be assumed to good red wines
sns.scatterplot(x='alcohol',y='density',hue='rating',data=df_wine,palette="husl")
plt.title('Quality based on Density and Alcohol',fontsize=16)
sns.set(rc={'figure.figsize':(20,8.27)})
We can observe from the above scatter plot that majority of the good wines have density lower than 1.00 and alcohol content between 12-14 %. On the other hand average wines have alcohol content between 8-12 %. Also we can observe the how alcohol and density are highly negative corealted.
The objective of this EDA was to analyze and understand what wine features may have most impact on the wine quality which was summed up to 3 categories viz bad,average and good. The dataset has majority of the values (76%) under 'average' wine quality, good wine quality was (20%) of the sample value whereas bad wine quality was around (4%) of the total sample value. Although the dataset was clean with no missing values or negative values,however the dataset lacks the balance with quality interval.
The dataset has few limitations as well like , it lacks the information of the category of wine like red or white. Due to mixed sample it was difficult to concur on quality based on features as both white and red wine will have different variations due to their base ingredients reaction. For eg. white wine may have slighter high fixed_acidity to give it a more crispiness in taste and also have high alcohol contents , that is why we see quite few exceptions in our data analysis.
We should add more variables like wine category, weather conditions, wine making process specifics to derive more valuable business insights.
